Multi-cursor transcription editing

ABSTRACT

A device, for use by a transcriptionist in a transcription editing system for editing transcriptions dictated by speakers, includes, in combination, a monitor configured to display visual text of transcribed dictations, an audio mechanism configured to cause playback of portions of an audio file associated with a dictation, and a cursor-control module coupled to the audio mechanism and to the monitor and configured to cause the monitor to display multiple cursors in the text.

BACKGROUND OF THE INVENTION

Healthcare costs in the United States account for a significant share ofthe GNP. The affordability of healthcare is of great concern to manyAmericans. Technological innovations offer an important leverage toreduce healthcare costs.

Many Healthcare institutions require doctors to keep accurate anddetailed records concerning diagnosis and treatment of patients.Motivation for keeping such records include government regulations (suchas Medicare and Medicaid regulations), desire for the best outcome forthe patient, and mitigation of liability. The records include patientnotes that reflect information that a doctor or other person adds to apatient record after a given diagnosis, patient interaction, lab test orthe like.

Record keeping can be a time-consuming task, and the physician's time isvaluable. The time required for a physician to hand-write or typepatient notes can represent a significant expense. Verbal dictation ofpatient notes offers significant time savings to physicians, and isbecoming increasingly prevalent in modern healthcare organizations.

Over time, a significant industry has evolved around the transcriptionof medical dictation. Several companies produce special-purpose voicemailbox systems for storing medical dictation. These centralized systemshold voice mailboxes for a large number of physicians, each of whom canaccess a voice mailbox by dialing a phone number and putting in his orher identification code. These dictation voice mailbox systems aretypically purchased or shared by healthcare institutions. Prices can beover $100,000 per voice mailbox system. Even at these prices, thesecentralized systems save healthcare institutions vast sums of money overthe cost of maintaining records in a more distributed fashion.

Using today's voice mailbox medical dictation systems, when a doctorcompletes an interaction with a patient, the doctor calls a dictationvoice mailbox, and dictates the records of the interaction with thepatient. The voice mailbox is later accessed by a medicaltranscriptionist who listens to the audio and transcribes the audio intoa text record. The playback of the audio data from the voice mailbox maybe controlled by the transcriptionist through a set of foot pedals thatmimic the action of the “forward”, “play”, and “rewind” buttons on atape player. Should a transcriptionist hear an unfamiliar word, thestandard practice is to stop the audio playback and look up the word ina printed dictionary.

The medical transcriptionist's time is less costly for the hospital thanthe doctor's time, and the medical transcriptionist is typically muchmore familiar with the computerized record-keeping systems than thedoctor is, so this system offers a significant overall cost saving tothe hospital.

Expedient processing of doctor's dictation is often desirable so thatrecords can be passed between one part of a healthcare institution andanother (such as from Radiology to Surgery), or so that records can bepassed to another institution if the next step in a patient's carerequires that the patient be moved to another facility. In addition tobeing timely, accuracy of medical transcriptions is of paramountimportance. A mistake in a medical transcription could mean thedifference between life and death. In transcribing doctor's orders forsuch procedures as chemotherapy and radiation therapy for cancerpatients, an elaborate system of double-checking by separate people isstandard to mitigate risk.

SUMMARY OF THE INVENTION

In general, in an aspect, the invention provides a device for use by atranscriptionist in a transcription editing system for editingtranscriptions dictated by speakers, the device including, incombination, a monitor configured to display visual text of transcribeddictations, an audio mechanism configured to cause playback of portionsof an audio file associated with a dictation, and a cursor-controlmodule coupled to the audio mechanism and to the monitor and configuredto cause the monitor to display multiple cursors in the text.

Implementations of the invention may include one or more of thefollowing features. The cursor-control module is configured to cause themonitor to display multiple cursors in the text that indicate differentfunctionality. The cursor-control module is configured to cause themonitor to display an audio cursor accentuating a portion of the text,the audio cursor accentuating different text as the audio file is playedusing the audio mechanism, and a text cursor indicative of a position inthe text where editing commands will be implemented. The audio cursorcomprises at least one of a rectangular box surrounding textcorresponding to a portion of the audio file, a rectangular boxsurrounding a line of text, a vertical line, an inverse-video portion ofthe monitor, and bolding of a portion of the text. The cursor-controlmodule is configured to determine wherein to cause the monitor todisplay the audio cursor by using a token-alignment file that associatesportions of the audio file with portions of the text. The cursor-controlmodule is configured to move at least one of the audio cursor and thetext cursor to a location of the other of the text cursor and the audiocursor, respectively. The audio mechanism is configured to determine andplay a portion of the audio file corresponding to text at the locationof the audio cursor when the audio cursor is moved to the location ofthe text cursor. The device further includes a change-recordingapparatus configured to record changes made to the text and associatethe changes with portions of the audio file whereby the recorded changescan be used to adapt speech recognition apparatus in accordance with thechanged text and the associated portions of the audio file.

In general, in another aspect, the invention provides a computer programproduct residing on a computer-readable medium and includingcomputer-readable instructions for causing a computer to display visualtext of transcribed dictations, cause playback of portions of an audiofile associated with a dictation, and cause the monitor to displaymultiple cursors in the text.

Implementations of the invention may include one or more of thefollowing features. The instructions are configured to cause the monitorto display an audio cursor accentuating a portion of the text with theaudio cursor accentuating different text as the audio file is played,and a text cursor indicative of a position in the text where editingcommands will be implemented. The cursor-control module is configured todetermine where to cause the monitor to display the audio cursor byusing a token-alignment file that associates portions of the audio filewith portions of the text. The computer program product further includesinstructions for causing the computer to move at least one of the audiocursor and the text cursor to a location of the other of the text cursorand the audio cursor, respectively. The computer program product furtherincludes instructions for causing the computer to determine and causeplaying of a portion of the audio file corresponding to text at thelocation of the audio cursor when the audio cursor is moved to thelocation of the text cursor. The computer program product furtherincludes instructions for causing the computer to record changes made tothe text and associate the changes with portions of the audio filewhereby the recorded changes can be used to adapt speech recognitionapparatus in accordance with the changed text and the associatedportions of the audio file.

In general, in another aspect, the invention provides a method ofprocessing text transcribed from an audio file, the method includingdisplaying text of a transcribed dictation on a monitor, playingportions of an audio file associated with the dictation, displaying anaudio cursor in the text on the monitor, the audio cursor accentuating aportion of the text with the audio cursor accentuating different text asthe audio file is played, and displaying a text cursor in the text onthe monitor, the text cursor being indicative of a position in the textwhere editing commands will be implemented.

Implementations of the invention may include one or more of thefollowing features. The method further includes using a token-alignmentfile that associates portions of the audio file with portions of thetext to determine where to display the audio cursor. The method furtherincludes moving at least one of the audio cursor and the text cursor toa location of the other of the text cursor and the audio cursor,respectively, in response to receiving a corresponding command. Themethod further includes playing of a portion of the audio filecorresponding to text at the location of the audio cursor if the audiocursor is moved to the location of the text cursor. The method furtherincludes recording changes made to the text, and associating the changeswith portions of the audio file. The method further includes using therecorded changes to adapt speech recognition apparatus in accordancewith the changed text and the associated portions of the audio file.

In general, in another aspect, the invention provides a method ofprocessing a recorded dictation, the method including analyzing therecorded dictation in accordance with speech models to convert therecorded dictation to a draft text, storing the draft text, andproducing and recording a token-alignment file that associates portionsof the draft text with portions of the audio file, the token-alignmentfile including tokens at least some of which are indicative of portionsof the draft text, the tokens indicating beginnings and ends of portionsof the recorded dictation associated with the portions of the draft textsuch that the portions of the recorded dictation are associated withcorresponding portions of the draft text even if the correspondingportions of the draft text, if spoken, do not correspond identically tothe corresponding portions of the recorded dictation.

Implementations of the invention may include one or more of thefollowing features. Producing and recording the token-alignment fileincludes producing and recording tokens for which there is nocorresponding draft text. The method further includes receiving arevised text associated with the recorded dictation, and using indiciaof differences between the revised text and the draft text and theassociated recorded dictation to modify the speech models for convertingother recorded dictations to other draft texts.

Various aspects of the invention may provide one or more of thefollowing capabilities. The cost of medical transcription can be reducedand/or the accuracy of medical transcription increased. The expediencyand turn-around time of medical transcription can be improved. Editingof transcriptions can be performed faster than with previous techniques.Transcribed text can be edited during playback of transcribed audio.Text other than that associated with audio currently being played can beedited without stopping playback of audio associated with a textdocument. Transcribed text can be selected and its corresponding audioplayed, e.g., regardless of a current portion of audio being played orhaving last been played. Transcriptionist productivity can be improved.Transcriptionist fatigue can be reduced.

These and other capabilities of the invention, along with the inventionitself, will be more fully understood after a review of the followingfigures, detailed description, and claims.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a simplified diagram of a system for transcribing dictationsand editing corresponding transcriptions.

FIG. 2 is a simplified block diagram of an editing device of the systemshown in FIG. 1.

FIGS. 3-5 are portions of a transcribed document showing exemplaryembodiments of audio and text cursors.

FIG. 6 is a block flow diagram of a process of producing and editing atranscription.

FIG. 7 is a block flow diagram of a process of reviewing a drafttranscribed document.

FIG. 8 is a block flow diagram of a process of editing the drafttranscribed document.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

Embodiments of the invention can provide multiple cursors for use inediting text documents each of which is associated with a digital audiosignal of speech to be transcribed. An audio cursor is provided thathighlights text associated with corresponding audio being played. Theaudio cursor tracks the audio signal to help the transcriptionist followalong visually with the text as the associated audio plays. A textcursor can be manipulated independently of the audio cursor by atranscriptionist. The text cursor indicates the location of editing tothe transcribed text, e.g., through a keyboard. The text cursor can bepositioned and edits to the text made and/or the audio cursor made tocoincide with the text cursor and have the corresponding audio played.Using embodiments of the invention, a transcriptionist can processmulti-modal inputs and reduce the amount of time the transcriptionistwould use to review and revise draft documents using previoustechniques. Other embodiments are within the scope of the invention.

Referring to FIG. 1, a system 10 for transcribing audio and editingtranscribed audio includes a speaker/person 12, a communicationsnetwork, 14, a voice mailbox system 16, and administrative console 18,an editing device 20, a communications network 22, a database server 24,a communications network 26, and an automatic transcription device 30.Here, the network 14 is preferably a public switched telephone network(PSTN) although other networks, including packet-switched networks couldbe used, e.g., if the speaker 12 uses an Internet phone for dictation.The network 22 is preferably a packet-switched network such as theglobal packet-switched network known as the Internet. The network 26 ispreferably a packet-switched, local area network (LAN). Other types ofnetworks may be used, however, for the networks 14, 22, 26, or any orall of the networks 14, 22, 26 may be eliminated, e.g., if items shownin FIG. 1 are combined or eliminated.

Preferably, the voice mailbox system 16, the administrative console 18,and the editing device 20 are situated “off site” from the databaseserver 24 and the automatic transcription device 30. Thesesystems/devices 16, 18, 20, however, could be located “on site,” andcommunications between them took place, e.g., over a local area network.Similarly, it is possible to locate the automatic transcription device30 off-site, and have the device 30 communicate with the database server24 over the 22.

The network 14 is configured to convey dictation from the speaker 12 tothe voice mailbox system 16. Preferably, the speaker 12 dictates into anaudio transducer such as a telephone, and the transduced audio istransmitted over the telephone network 14 into the voice mailbox system16, such as the Intelliscript™ product made by eScription™ of Needham,Mass. The speaker 12 may, however, use means other than a standardtelephone for creating a digital audio file for each dictation. Forexample, the speaker 12 may dictate into a handheld PDA device, thatincludes its own digitization mechanism for storing the audio file. Or,the speaker 12 may use a standard “dictation station,” such as thoseprovided by many vendors. Still other devices may be used by the speaker12 for dictating, and possibly digitizing the dictation, and sending itto the voice mailbox system 16.

The voice mailbox system 16 is configured to digitize audio from thespeaker 12 to produce a digital audio file of the dictation. Forexample, the system 16 may use the Intelliscript™ product made byeScription.

The voice mailbox system 16 is further configured to prompt the speaker12 to enter an identification code and a worktype code. The speaker 12can enter the codes, e.g., by pressing buttons on a telephone to sendDTMF tones, or by speaking the codes into the telephone. The system 16may provide speech recognition to convert the spoken codes into adigital identification code and a digital worktype code. The mailboxsystem 16 is further configured to store the identifying code and theworktype code in association with the dictation. The system 16preferably prompts the speaker 12 to provide the worktype code at leastfor each dictation related to the medical field. The worktype codedesignates a category of work to which the dictation pertains, e.g., formedical applications this could include Office Note, Consultation,Operative Note, Discharge Summary, Radiology report, etc.

The voice mailbox system 16 is further configured to transmit thedigital audio file and speaker identification code over the network 22to the database server 24 for storage. This transmission is accomplishedby the system 16 product using standard network transmission protocolscommunicating with the database server 24.

The database server 24 is configured to store the incoming data from thevoice mailbox system 16, as well as from other sources. The databaseserver 24 may include the EditScript Server™ database product fromeScription. Software of the database server is configured to produce adatabase record for the dictation, including a file pointer to thedigital audio data, and a field containing the identification code forthe speaker 12. If the audio and identifying data are stored on a PDA,the PDA may be connected to a computer running the HandiScript™ softwareproduct made by eScription that will perform the data transfer andcommunication with the database server 24 to enable a database record tobe produced for the dictation.

Preferably, all communication with the database server 24 isintermediated by a “servlet” application 32 that includes an in-memorycached representation of recent database entries. The servlet 32 isconfigured to service requests from the voice mailbox system 16, theautomatic transcription device, the editing device 20, and theadministrative console 18, reading from the database when the servlet'scache does not contain the required information. The servlet 32 includesa separate software module that helps ensure that the servlet's cache issynchronized with the contents of the database. This helps allow thedatabase to be off-loaded of much of the real-time data-communicationand to grow to be much larger than otherwise possible. For simplicity,however, the below discussion does not refer to the servlet, but alldatabase access activities may be realized using the servlet application32 as an intermediary.

The automatic transcription device 30 may access the database 40 in thedatabase server 24 over the data network 26 for transcribing the storeddictation. The automatic transcription device 30 uses an automaticspeech recognition (ASR) device (e.g., software) to produce a drafttranscription for the dictation. An example of ASR technology is theAutoScript™ product made by eScription, that also uses the speaker and,optionally, worktype identifying information to access speaker andspeaker-worktype dependent ASR models with which to perform thetranscription. The device 30 transmits the draft transcription over thedata network 26 to the database server 24 for storage in the databaseand to be accessed, along with the digital audio file, by the editingdevice 20.

The device 30 is further configured to affect the presentation of thedraft transcription. The device 30, as part of speech recognition or aspart of post-processing after speech recognition, can add or changeitems affecting document presentation such as formats, abbreviations,and other text features. The device 30 includes a speech recognizer andmay also include a post-processor for performing operations in additionto the speech recognition, although the speech recognizer itself mayperform some or all of these additional functions.

The transcription device 30 is further configured to produce atoken-alignment file that synchronizes the audio with the correspondingtext. This file comprises a set of token records, with each recordpreferably containing a token, a begin index, and an end index. Thetoken comprises a character or a sequence of characters that are toappear on the screen during a word-processing session, or one or moresounds that may or may not appear as text on a screen. A begin indexcomprises an array reference into the audio file corresponding to theplace in the audio file where the corresponding token begins. The endindex comprises an array reference into the digital audio filecorresponding to the point in the audio file where the correspondingtoken ends. As an alternative, the end index may not exist separately,with it being assumed that the starting point of the next token (thenext begin index) is also the ending point of the previous token. Thetranscription device 30 can store the token-alignment file in thedatabase 40.

The token-alignment file may contain further information, such as adisplay indicator and/or a playback indicator. The display indicator'svalue indicates whether the corresponding token is to be displayed,e.g., on a computer monitor, while the transcription is being edited.Using non-displayed tokens can help facilitate editing of thetranscription while maintaining synchronization between on-screen tokensand the digital audio file. For example, a speaker may use an alias,e.g., for a heading, and standard heading (e.g., Physical Examination)may be displayed while the words actually spoken by the speaker (e.g.,“On exam today”) are audibly played but not displayed as text (hidden).The playback indicator's value indicates whether the corresponding tokenhas audio associated with the token. Using the playback indicator canalso help facilitate editing the transcription while maintainingsynchronization between on-screen tokens and the digital audio file. Theplayback indicator's value may be adjusted dynamically during audioplayback, e.g., by input from the transcriptionist. The adjustment may,e.g., cause audio associated with corresponding tokens (e.g., hesitationwords) to be skipped partially or entirely, that may help increase thetranscriptionist's productivity.

The tokens stored in the token-alignment file may or may not correspondto words. Instead, a token may represent one or more characters thatappear on a display during editing of the transcription, or sounds thatoccur in the audio file. Thus, the written transcription may have adifferent form and/or format than the exact words that were spoken bythe person 12. For example, a token may represent conventional wordssuch as “the,” “patient,” or “esophagogastroduodenoscopy,” multiplewords, partial words, abbreviations or acronyms, numbers, dates, sounds(e.g., a cough, a yawn, a bell), absence of sound (silence), etc. Forexample, the speaker 12 may say “USA” and the automatic transcriptiondevice 30 may interpret and expand this into “United States of America.”In this example, the token is “United States of America” and the beginindex would point to the beginning of the audio signal for “USA” and, ifthe token-alignment file uses end indexes, the end index would point tothe end of the audio signal “USA.” As another example, the speaker 12might say “April 2 of last year,” and the text might appear on thedisplay as “04/02/2003.” The tokens, however, can synchronize the text“04/02/2003” with the audio of “April 2 of last year.” As anotherexample, the speaker 12 might say “miles per hour” while the text isdisplayed as “MPH.” Using the tokens, the speech recognizer 30, or apost-processor in or separate from the device 30, may alter, expand,contract, and/or format the spoken words when converting to text withoutlosing the audio synchronization. Tokens preferably have variablelengths, with different tokens having different lengths.

The token-alignment file provides an environment with many features.Items may appear on a screen but not have any audio signal associatedwith them (e.g., implicit titles and headings). Items may have audioassociated with them and may appear on the screen but may not appear aswords (e.g., numeric tokens such as “120/88”). Items may have audioassociated with them, appear on the screen, and appear as wordscontained in the audio (e.g., “the patient showed delayed recovery”).Multiple words may appear on the screen corresponding to audio that isan abbreviated form of what appears on the screen (e.g., “United Statesof America” may be displayed corresponding to audio of “USA”). Items mayhave audio associated with them but not have corresponding symbolsappear on the screen (e.g., a cough, an ending salutation such as“that's all,” commands or instructions to the transcriptionist such as“start a new paragraph,” etc.).

The editing device 20 is configured to be used by a transcriptionist toaccess and edit the draft transcription stored in the database of thedatabase server 24. The editing device 20 includes a computer (e.g.,display, keyboard, mouse, monitor, memory, and a processor, etc.), anattached foot-pedal, and appropriate software such as the EditScript™software product made by eScription. The transcriptionist can request adictation job by, e.g., clicking on an on-screen icon. The request isserviced by the database server 24, that finds the dictation for thetranscriptionist, and transmits the corresponding audio file and thedraft transcription text file. The transcriptionist edits the draftusing the editing device 20 and sends the edited transcript back to thedatabase server 24. For example, to end the editing the transcriptionistcan click on an on-screen icon button to instruct the editing device 20to send the final edited document to the database server 24 via thenetwork 22, along with a unique identifier for the transcriptionist.With the data sent from the editing device 20, the database in theserver 24 contains, for each dictation: a speaker identifier, atranscriptionist identifier, a file pointer to the digital audio signal,and a file pointer to the edited text document.

The edited text document can be transmitted directly to a customer'smedical record system or accessed over the data network 22 from thedatabase by the administrative console 18. The console 18 may include anadministrative console software product such as Emon™ made byeScription.

Referring to FIG. 2, components of the editing device 20, e.g., acomputer, include a database interaction module 40, a user interface 42,a word processor module 44, an audio playback module 46, an audio filepointer 48, a cursor module 50, a monitor 52, and an audio device 54. Acomputer implementing portions of the editing device 20 includes aprocessor and memory that stores appropriate computer-readable,computer-executable software code instructions that can cause theprocessor to execute appropriate instructions for performing functionsdescribed. The monitor 52 and audio device 54, e.g., speakers, arephysical components while the other components shown in FIG. 2 arefunctional components that may be implemented with software, hardware,etc., or combinations thereof. The audio playback device 46, such as aSoundBlaster® card, is attached to the audio output transducer 54 suchas speakers or headphones. The transcriptionist can use the audio device54 (e.g., headphones or a speaker) to listen to audio and can view themonitor 52 to see the corresponding text. The transcriptionist can usethe foot pedal 66, the keyboard 62, and/or the mouse 64 to control theaudio playback. The database interaction, audio playback, and editing ofthe draft transcription is accomplished by means of the appropriatesoftware such as the EditScript Client™ software product made byeScription. The editing software is loaded on the editing devicecomputer 20 and configured appropriately for interaction with othercomponents of the editing device 20. The editing software can use astandard word processing software library, such as that provided withMicrosoft Word®, in order to load, edit and save documents correspondingto each dictation.

The editing software includes the database interaction module 40, theuser interface module 42, the word processing module 44, the audioplayback module 46, the audio file pointer adjustment module 48, and themulti-cursor control module 50. The control module 50 regulates theinteraction between the interface module 42 and the word processor 44,the audio playback module 46, and the audio file pointer 48. The controlmodule 50 regulates the flow of actions relating to processing of atranscription, including playing audio and providing cursors in thetranscribed text, as discussed below especially with respect to FIG. 7.The user interface module 42 controls the activity of the other modulesand includes keyboard detection 56, mouse detection 58, and foot pedaldetection 60 sub-modules for processing input from a keyboard 62, amouse 64, and a foot-pedal 66. The foot pedal 66 is a standardtranscription foot pedal and is connected to the editing device computerthrough the computer's serial port. The foot pedal 66 preferablyincludes a “fast forward” portion and a “rewind” portion.

The transcriptionist can request a job from the database by selectingon-screen icon with the mouse 64. The user interface module 42interprets this mouse click and invokes the database interaction module40 to request the next job from the database. The database server 24(FIG. 1) responds by transmitting the audio data file, the drafttranscription file, and the token-alignment file to the user interactionmodule 42. With this information, the editing software can initialize aword-processing session by loading the draft text into the wordprocessing module 44.

The audio playback module 46 is configured to play the audio file storedin the database. For initial playback, the module 46 plays the audiofile sequentially. The playback module 46 can, however, jump to audiocorresponding to an indicated portion of the transcription and beginplayback from the indicated location. The location may be indicated by atranscriptionist using appropriate portions of the editing device 20such as the keyboard 62, or the mouse 64 as discussed below. Forplayback that starts at an indicated location, the playback module 46uses the token-alignment file to determine the location in the audiofile corresponding to the indicated transcription text. Since many audioplayback programs play audio in fixed-sized sections (called “frames”),the audio playback module 46 may convert the indicated begin index tothe nearest preceding frame for playback. For example, an audio device54 may play only frames of 128 bytes in length. In this example, theaudio playback module uses the token-alignment file to find the nearestprior starting frame that is a multiple of 128 bytes from the beginningof the audio file. Thus, the starting point for audio playback may notcorrespond precisely to the selected text in the transcription.

The transcriptionist can review and edit a document by appropriatelycontrolling portions of the editing device 20. The transcriptionist canregulate the playback using the foot pedal 66, and listen to the audiocorresponding to the text as played by the playback module 46 andconverted to sound by the audio device 54. Further, the transcriptionistcan move a cursor to a desired portion of the display of the monitor 52using the keyboard 62 and/or mouse 64, and can make edits at thelocation of the cursor using the keyboard 62 and/or mouse 64.

While the transcriptionist is editing the document, the user interfacemodule 42 can service hardware interrupts from all three of itssub-modules 56, 58, 60. The transcriptionist can use the foot pedal 66to indicate to that the audio should be “rewound,” or “fast-forwarded”to a different time point in the dictation. These foot-pedal presses areserviced as hardware interrupts by the user interaction module 42. Moststandard key presses and on-document mouse-clicks are sent to the wordprocessing module 44 to perform the document editing functions indicatedand to update the monitor display. Some user interaction, however, maybe directed to the audio-playback oriented modules 46, 48, 50, e.g.,cursor control, audio position control, and/or volume control. Thetranscriptionist may indicate that editing is complete by clickinganother icon. In response to such an indication, the final text file issent through the database interaction module 42 to the database server24.

Referring also to FIG. 3, the cursor module 50 is configured to providean audio cursor 70 and a text cursor 72 on the monitor 52 in conjunctionwith the display of the draft transcription 74 for editing by thetranscriptionist. The cursor module 50 provides the cursors 70 and 72independently.

The audio cursor 70, under the control of the cursor module 50, tracksthe text in the document 74 as the corresponding audio is played to helpthe transcriptionist follow along in the text 74 with the correspondingaudio. The audio cursor 70 moves in conjunction with the audio, aslinked to the text 74 by the token-alignment file, to help thetranscriptionist follow the text 74 corresponding to thecurrently-played audio. In order to highlight the text 74, the audiocursor 70 may take a variety of different forms. For example, as shownin FIG. 3, the audio cursor provides a box 76 around the text of thetoken corresponding to the audio presently being played. The box 76 mayalso take a variety of forms to distinguish it from other portions ofthe document 74, such as a rectangular outline of the box 76, and/or asolid box (e.g., inverse video), and may be of a variety of colors suchas red against black letters on a white background. As another example,referring to FIG. 4, the audio cursor 70 may be a box 78 that highlightsthe entire line (or lines) of text that includes the text of the tokencorresponding to the audio currently being played. The text cursor 72could be a box 80, e.g., of a single character in width. A text cursor73 indicates other possible features of a text cursor, including that atext cursor can highlight an entire word and can be positioned withintext highlighted by the audio cursor 70. Further, FIG. 4 illustratesthat more than two cursors could be provided. As another example,referring to FIG. 5, the audio cursor 70 could be a vertical line cursor82 that highlights text, e.g., the beginning of the text of the tokencurrently being played, or the beginning of the line of text includingthe token currently being played. Other possibilities include usinghighlighting capabilities or bold characters to transiently emphasize aword, series of words, or line(s) of text. Still other forms of theaudio cursor 70 may be used. Preferably, the audio cursor 70 isprecisely aligned with the currently-played audio, but the cursor 70 mayapproximate the audio, e.g., with groups of words or one or more entirelines of text being indicated by the audio cursor 70.

The text cursor 72 provided by the cursor module 50 indicates thecurrent location for editing in the document 74. The transcriptionistcan manipulate the keyboard 62 and/or mouse 64 to control the locationof the text cursor 74. The cursor 74 indicates where editing will occur,e.g., addition of text through the keyboard 62, deletion of text,alteration of formatting, insertion of paragraph or page breaks, etc.The transcriptionist can edit the document using the text cursor 72 instandard fashion. The text cursor 72 in combination with the audiocursor 70, however, provides for multi-tasking by the transcriptionist.To make edits, the transcriptionist positions the text cursor 72 instandard fashion and makes the desired change(s).

Edits to the text 74 can be made without losing synchronization with theaudio. Changes to the text 74 are tracked, with records being made ofwhich characters or other edits are inserted and where, and whichcharacters or other features (e.g., editing, page breaks, etc.) areremoved. Preferably, the word processor 44 implements a track-changesfeature, maintaining the original document and storing indications ofchanges.

The track-changes feature implemented by the word processor 44 producesa file of changes (e.g., textual, formatting, etc.) to the original text74. The information regarding these changes, especially text changessuch as different expansions of abbreviations, different spellings,etc., may be used to adapt the speech recognizer 30. In conjunction withthe synchronization information provided by the automatic transcriptiondevice 30 by means of the token-alignment file, the file of changesprovides a useful tool for continuous learning/improvement of speechmodels used for speech recognition by the automatic transcription device30.

The text cursor 72 may be used to change the location of the audiocursor 70, and thus the audio currently played through commands, e.g.,from the keyboard 62 and/or the mouse 64, implemented by the cursorcontrol module 50. Movement to a different part of the audio istypically implemented by the audio file pointer module 48 byincrementing or decrementing a pointer into the digital audio file. Thelocation of the audio cursor 70 and thus the current audio for playback,however, may be changed using the text cursor 72. The transcriptionistcan position the text cursor 72 to the desired portion of the text 74for audio playback and actuate appropriate commands. For example, thetranscriptionist may use one or more hot keys (e.g., a sequence of keys)and/or one or more mouse clicks (e.g., on screen icons) to cause theaudio cursor 70 to move to the position of the text cursor 72, with theaudio file pointer being adjusted accordingly. The correct position inthe audio file is determined by the audio file pointer module 48 byfinding the corresponding token in the token-alignment file. Thecorresponding token may be a nearest, preferably preceding, token thatis associated with text in the document 74. Thus, if thetranscriptionist attempts to position the audio cursor 70 in text thatwas added after speech recognition, e.g., added by the transcriptionist,then the audio file pointer module 48 uses track-changes informationfrom the word processor 44 to determine the appropriate token. Themodule 48 determines that the text at the position of the text cursor 72is not in the token-alignment file, and finds the token in thetoken-alignment file that is nearest, and preferably preceding, theinserted text using information regarding the original document from thetrack-changes information.

The text cursor 72 may also be moved to the position of the audio cursor70. For example, one or more hot keys and/or one or more mouse clickscan be used to cause the text cursor 72 to jump from its currentposition to a position at, adjacent, or near the position of the audiocursor 70. Thus, for example, if the transcriptionist hears audio andrecognizes that the highlighted corresponding text should be edited,then the transcriptionist can cause the text cursor 72 to jump to thelocation of the audio cursor 70 to quickly position the text cursor 72for editing of the desired text. Preferably, the text cursor 72 canhighlight the text highlighted by the audio cursor 70 such that textentered by the transcriptionist will overwrite the highlighted text,obviating deletion of the text by the transcriptionist and therebysaving time.

In operation, referring to FIG. 6, with further reference to FIGS. 1-3,a process 90 for producing and editing a transcription of speech usingthe system 10 includes the stages shown. The process 90, however, isexemplary only and not limiting. The process 90 may be altered, e.g., byhaving stages added, removed, or rearranged.

At stage 92, the speaker 12 dictates desired speech to be converted totext. The speaker can use, e.g., a hand-held device such as a personaldigital assistant, to dictate audio that is transmitted over the network14 to the voice mailbox 16. The audio is stored in the voice mailbox 16as an audio file. The audio file is transmitted over the network 22 tothe database server 24 and is stored in the database 40.

At stage 94, the automatic transcription device 30 transcribes the audiofile. The device 30 accesses and retrieves the audio file from thedatabase 40 through the LAN 26. A speech recognizer of the device 30analyzes the audio file in accordance with speech models to produce adraft text document 74 from the audio file and store the draft document74 in the database 40. The device 30 also produces a correspondingtoken-alignment file that includes the draft document 74 and associatesportions of the audio file with the transcribed text of the document 74.The device 30 stores the token-alignment file in the database 40 via theLAN 26.

At stage 96, the transcriptionist reviews and edits the transcribeddraft document 74 as appropriate. The transcriptionist uses the editingdevice 20 to access the database 40 and retrieve the audio file and thetoken-alignment file that includes the draft text document 74. Thetranscriptionist plays the audio file and reviews the corresponding textas highlighted or otherwise indicated by the audio cursor 70 and makesdesired edits using the text cursor 72. The reviewing of this stage isdetailed below with respect to FIG. 7. The word processor 44 producesand stores track-changes information in response to edits made by thetranscriptionist.

At stage 98, the track-changes information is provided to the automatictranscription device 30 for use in improving the speech models used bythe speech recognizer of the device 30 by analyzing the transcribeddraft text and what revisions were made by the transcriptionist. Themodels can be adjusted so that the next time the speech recognizeranalyzes speech that was edited by the transcriptionist, the recognizerwill transcribe the same or similar audio to the edited text instead ofthe draft text previously provided. At stage 100, the word processorprovides a final, revised text document as edited by thetranscriptionist. This final document can be stored in the database 40and provided via the network 22 to interested parties, e.g., the speakerthat dictated the audio file.

Referring to FIG. 7, with further reference to FIGS. 1-3 and 6, aprocess 110 for reviewing the draft transcribed document 74, stage 86 ofFIG. 6, using the editing device 20 includes the stages shown. Theprocess 110, however, is exemplary only and not limiting. The process110 may be altered, e.g., by having stages added, removed, orrearranged.

At stage 112, a token in the token-alignment file is obtained. The nexttoken in the file is obtained in the normal course of audio playback inthe absence of transcriptionist input. If, however, the transcriptionistcauses a change in the location of the audio cursor, then the tokencorresponding to the new location of the audio cursor is obtained.

At stage 114, the text most nearly associated with the current token islocated. This text may be text associated with a token adjacent to thecurrent token, e.g., if the current token does not have text directlyassociated with it (e.g., a cough). Text entered by the transcriptionistis ignored in determining the most-nearly-associated text.

At stage 116, the cursor control module 50 displays the audio cursor 70to accentuate the text determined to be most nearly associated with thecurrent token. The control module 50 draws the audio cursor 70 tohighlight the text, e.g., drawing the cursor 70 around, near, etc., thedetermined text. The location of the text corresponding to tokens may bedetermined dynamically as the token-alignment file is stepped through inorder to display the audio cursor 70. Alternatively, locations (e.g.,within a document or on a screen) for tokens can be determined beforestepping through the token-alignment file to play back the audio (e.g.,upon loading of the token-alignment file). In this alternative, thelocations can be re-calculated for added or removed text (on the flywhen the text is changed, after changes are made, in response to are-determine command, etc.). Other alternatives are also possible.

At stage 118, the audio file pointer module 48 determines the positionin the audio file corresponding to the current token. The module 48 usesthe token-alignment file and the selected token to find the location inthe audio file corresponding to the current token.

At stage 120, the audio file pointer module 48 selects a portion of theaudio file for playback. The module 48 selects a frame of audioassociated with the token for submission to the audio playback module46.

At stage 122, the audio playback module 46 controls playback of theselected audio frame. The module 46 provides control signals to theaudio device 54 to audibly play the corresponding audio for thetranscriptionist to hear.

Referring to FIG. 8, with further reference to FIGS. 1-3 and 6-7, aprocess 130 for editing the draft transcribed document 74, stage 86 ofFIG. 6, using the editing device 20 includes the stages shown. Theprocess 130, however, is exemplary only and not limiting. The process130 may be altered, e.g., by having stages added, removed, orrearranged.

At stage 132, the transcriptionist positions the text cursor 72 asdesired for editing of the document 74. The transcriptionist can movethe text cursor 72 independently of the audio cursor 74, e.g., using thekeyboard 62 and/or mouse 64. The transcriptionist may also, oralternatively, move the text cursor 72 dependent upon the audio cursor70 by causing the text cursor 72 to move to, or near to, the position ofthe audio cursor 70.

At stage 134, the audio corresponding to the location of the text cursor72 is played if the audio cursor 70 is synched to the text cursor 72. Ifthe transcriptionist causes the audio cursor 70 to move to the locationof the text cursor 72, then the audio for the new location of the audiocursor 70 is preferably played to assist the transcriptionist determinewhether edits to the text are desired.

At stage 136, desired edits to the text 74 at the location of the textcursor 72 are made by the transcriptionist. With the text cursor 72placed as desired, edits can be made as indicated by thetranscriptionist (e.g., using the keyboard 62) and implemented by theword processor 44. The audio may continue to play while changes arebeing made at the location of the text cursor 72. The transcriptionistmay, however, stop the audio playback using, e.g., the foot pedal 66,keyboard commands, etc. The audio playback may be managed independentlyof editing of the text 74.

Other embodiments are within the scope and spirit of the appendedclaims. For example, due to the nature of software, functions describedabove can be implemented using software, hardware, firmware, hardwiring,or combinations of any of these. Features implementing functions mayalso be physically located at various positions, including beingdistributed such that portions of functions are implemented at differentphysical locations. Further, while two cursors were discussed above,more than two cursors could be employed and implemented by the cursorcontrol module 50. For example, there could be an audio cursor andmultiple text cursors, e.g., one controlled by the mouse 64 and onecontrolled by the keyboard 62. Other arrangements and numbers of cursorscould be implemented.

What is claimed is:
 1. A device for use by a transcriptionist in a transcription editing system for editing transcriptions dictated by speakers, the device comprising, in combination: a monitor configured to display visual text of transcribed dictations; an audio mechanism configured to cause playback of portions of an audio file associated with a dictation; and a cursor-control module coupled to the audio mechanism and to the monitor and configured to cause the monitor to display multiple cursors in the text.
 2. The device of claim 1 wherein the cursor-control module is configured to cause the monitor to display multiple cursors in the text that indicate different functionality.
 3. The device of claim 2 wherein the cursor-control module is configured to cause the monitor to display: an audio cursor accentuating a portion of the text, the audio cursor accentuating different text as the audio file is played using the audio mechanism; and a text cursor indicative of a position in the text where editing commands will be implemented.
 4. The device of claim 3 wherein the audio cursor comprises at least one of a rectangular box surrounding text corresponding to a portion of the audio file, a rectangular box surrounding a line of text, a vertical line, an inverse-video portion of the monitor, and bolding of a portion of the text.
 5. The device of claim 3 wherein the cursor-control module is configured to determine wherein to cause the monitor to display the audio cursor by using a token-alignment file that associates portions of the audio file with portions of the text.
 6. The device of claim 3 wherein the cursor-control module is configured to move at least one of the audio cursor and the text cursor to a location of the other of the text cursor and the audio cursor, respectively.
 7. The device of claim 6 wherein the audio mechanism is configured to determine and play a portion of the audio file corresponding to text at the location of the audio cursor when the audio cursor is moved to the location of the text cursor.
 8. The device of claim 1 further comprising a change-recording apparatus configured to record changes made to the text and associate the changes with portions of the audio file whereby the recorded changes can be used to adapt speech recognition apparatus in accordance with the changed text and the associated portions of the audio file.
 9. A computer program product residing on a computer-readable medium and comprising computer-readable instructions for causing a computer to: display visual text of transcribed dictations; cause playback of portions of an audio file associated with a dictation; and cause the monitor to display multiple cursors in the text.
 10. The computer program product of claim 9 wherein the instructions are configured to cause the monitor to display: an audio cursor accentuating a portion of the text with the audio cursor accentuating different text as the audio file is played; and a text cursor indicative of a position in the text where editing commands will be implemented.
 11. The computer program product of claim 10 wherein the cursor-control module is configured to determine where to cause the monitor to display the audio cursor by using a token-alignment file that associates portions of the audio file with portions of the text.
 12. The computer program product of claim 10 further comprising instructions for causing the computer to move at least one of the audio cursor and the text cursor to a location of the other of the text cursor and the audio cursor, respectively.
 13. The computer program product of claim 12 further comprising instructions for causing the computer to determine and cause playing of a portion of the audio file corresponding to text at the location of the audio cursor when the audio cursor is moved to the location of the text cursor.
 14. The computer program product of claim 9 further comprising instructions for causing the computer to record changes made to the text and associate the changes with portions of the audio file whereby the recorded changes can be used to adapt speech recognition apparatus in accordance with the changed text and the associated portions of the audio file.
 15. A method of processing text transcribed from an audio file, the method comprising: displaying text of a transcribed dictation on a monitor; playing portions of an audio file associated with the dictation; displaying an audio cursor in the text on the monitor, the audio cursor accentuating a portion of the text with the audio cursor accentuating different text as the audio file is played; and displaying a text cursor in the text on the monitor, the text cursor being indicative of a position in the text where editing commands will be implemented.
 16. The method of claim 15 further comprising using a token-alignment file that associates portions of the audio file with portions of the text to determine where to display the audio cursor.
 17. The method of claim 15 further comprising moving at least one of the audio cursor and the text cursor to a location of the other of the text cursor and the audio cursor, respectively, in response to receiving a corresponding command.
 18. The method of claim 17 further comprising playing of a portion of the audio file corresponding to text at the location of the audio cursor if the audio cursor is moved to the location of the text cursor.
 19. The method of claim 15 further comprising: recording changes made to the text; and associating the changes with portions of the audio file.
 20. The method of claim 19 further comprising using the recorded changes to adapt speech recognition apparatus in accordance with the changed text and the associated portions of the audio file.
 21. A method of processing a recorded dictation, the method comprising: analyzing the recorded dictation in accordance with speech models to convert the recorded dictation to a draft text; storing the draft text; and producing and recording a token-alignment file that associates portions of the draft text with portions of the audio file, the token-alignment file including tokens at least some of which are indicative of portions of the draft text, the tokens indicating beginnings and ends of portions of the recorded dictation associated with the portions of the draft text such that the portions of the recorded dictation are associated with corresponding portions of the draft text even if the corresponding portions of the draft text, if spoken, do not correspond identically to the corresponding portions of the recorded dictation.
 22. The method of claim 21 wherein producing and recording the token-alignment file includes producing and recording tokens for which there is no corresponding draft text.
 23. The method of claim 21 further comprising: receiving a revised text associated with the recorded dictation; and using indicia of differences between the revised text and the draft text and the associated recorded dictation to modify the speech models for converting other recorded dictations to other draft texts. 